NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

PhantomWiki: On-Demand Datasets for Reasoning and Retrieval Evaluation

Gong, Albert; Stankeviciute, Kamile; Wan, Chao; Kabra, Anmol; Thesmar, Raphael; Lee, Johann; Klenke, Julius; Gomes, Carla P; Weinberger, Kilian Q (July 2025, Proceedings of Machine Learning Research)

High-quality benchmarks are essential for evaluating reasoning and retrieval capabilities of large language models (LLMs). However, curating datasets for this purpose is not a permanent solution as they are prone to data leakage and inflated performance results. To address these challenges, we propose PhantomWiki: a pipeline to generate unique and factually consistent document corpora with diverse question-answer pairs. Unlike prior work, PhantomWiki is neither a fixed dataset, nor is it based on any existing data. Instead, a new PhantomWiki instance is generated on demand for each evaluation. We vary the question difficulty and corpus size to disentangle reasoning and retrieval capabilities respectively, and find that PhantomWiki datasets are surprisingly challenging for frontier LLMs. Thus, we contribute a scalable and data leakage-resistant framework for disentangled evaluation of reasoning, retrieval, and tool-use abilities.
more » « less
Full Text Available
Latent Diffusion for Language Generation

Lovelace, Justin; Kishore, Varsha; Wan, Chao; Shekhtman, Eliot; Weinberger, Kilian Q. (December 2023, Advances in neural information processing systems)
Oh, Alice; Naumann, Tristan; Globerson, Amir; Saenko, Kate; Hardt, Moritz; Levine, Sergey (Ed.)
Diffusion models have achieved great success in modeling continuous data modalities such as images, audio, and video, but have seen limited use in discrete domains such as language. Recent attempts to adapt diffusion to language have presented diffusion as an alternative to existing pretrained language models. We view diffusion and existing language models as complementary. We demonstrate that encoder-decoder language models can be utilized to efficiently learn high-quality language autoencoders. We then demonstrate that continuous diffusion models can be learned in the latent space of the language autoencoder, enabling us to sample continuous latent representations that can be decoded into natural language with the pretrained decoder. We validate the effectiveness of our approach for unconditional, class-conditional, and sequence-to-sequence language generation. We demonstrate across multiple diverse data sets that our latent language diffusion models are significantly more effective than previous diffusion language models. Our code is available at https://github.com/justinlovelace/latent-diffusion-for-language .
more » « less
Full Text Available
IncDSI: Incrementally Updatable Document Retrieval

Kishore, Varsha; Wan, Chao; Lovelace, Justin; Artzi, Yoav; Weinberger, Kilian Q. (July 2023, Proceedings of the 40th International Conference on Machine Learning)
Krause, Andreas; Brunskill, Emma; Cho, Kyunghyun; Engelhardt, Barbara; Sabato, Sivan; Scarlett, Jonathan (Ed.)
Differentiable Search Index is a recently proposed paradigm for document retrieval, that encodes information about a corpus of documents within the parameters of a neural network and directly maps queries to corresponding documents. These models have achieved state-of-the-art performances for document retrieval across many benchmarks. These kinds of models have a significant limitation: it is not easy to add new documents after a model is trained. We propose IncDSI, a method to add documents in real time (about 20-50ms per document), without retraining the model on the entire dataset (or even parts thereof). Instead we formulate the addition of documents as a constrained optimization problem that makes minimal changes to the network parameters. Although orders of magnitude faster, our approach is competitive with re-training the model on the whole dataset and enables the development of document retrieval systems that can be updated with new information in real-time. Our code for IncDSI is available at \href{https://github.com/varshakishore/IncDSI}{https://github.com/varshakishore/IncDSI}.
more » « less
Full Text Available

Search for: All records